This lesson will guide students through the process of merging two messy dataframes by a common key using merge.
Prerequisites
Teaching Objectives
By the end of this lecture, I want you to be able to:
Roadmap
Our goal today is to learn how to merge two tables, or dataframes, in R. There a number of different ways to do so, depending on what you want your final output to look like. Below, we will go through some of the most common methods to merge dataframes.
We will be using two mini dataframes as an example, each containing individuals from a iconic duo in media. Later on in the lecture we will be creating our own dataset of duos, so start thinking of your favorites!
If you would like to test these dataframes within R yourself, you can run this code to get the dataframes used in the examples. The results will be identical to the figures.
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/2_merge_examples.R")
rbindIf you have two dataframes and want to combine them, you need to decide what direction you want to combine them in. If you would like to combine them vertically, or stack the rows on top of each other, you can use rbind or row bind. rbind is helpful if you have two dataframes with the same columns, and you want to combine the cases (rows). However, it will not work if the columns are different. rbind accepts an arbitrary number of dataframes, here we use two: upper_rbind_df and lower_rbind_df.
outcome = rbind(upper_rbind_df, lower_rbind_df)
rbind example 1
cbindIf you would like to combine two dataframes horizontally, so that you add more columns on to a dataframe, you can use cbind or column bind. The cbind function takes an arbitrary number of dataframes as it’s arguments. We’ll provide just our two here, but try repeating one multiple times if you would like to see what happens.
outcome = cbind(left_cbind_df, right_cbind_df)
cbind example 1
We can see out new outcome dataframe was created as expected. This method can work, but you must be very confident about the structure of your data. For example, in the following figure, what issues do you expect to run into with this method?
cbind example 2
mergecbind may work in some situations, but is very “brittle” – or easy to break – with any changes to the data. A more reliable method of merging makes use of a common key between two sources of data. Think of a key like a luggage tag on a suitcase: it’s a small bit of information that clearly links to things, even if they are separated. In terms of data, a key can be anything, like a numerical ID or a string, but they must always be unique. Joins that use the key method will be familiar to anyone who has used SQL in the past, as they follow the same naming convention as in that language.
An inner join tries to join two sets of data using a common key, keeping only the data that exists in both of the original data sets. This is commonly illustrated using a venn diagram similar to the one below. Only the area highlighted with green will be included in the output.
Inner Join - Venn Diagram
In the context of our data, it might look something like the following if we use hometown as our key. We can use the following code to do an inner merge using the merge function. The merge function in R requires two arguments, x and y, which are the dataframes we would like to merge. We will be using left_merge_df and right_merge_df, and asking merge to use hometown as the key to merge by.
outcome = merge(x = left_merge_df, y = right_merge_df, by = "hometown")
Inner Join - Example Data
Everything seems in order. We can see that when using an inner join, cases where both dataframes have a single row with our key will be matched and joined into a single output dataframe. However, if there are rows in either data frame without a match, those rows will be dropped from our data.
An Outer Join is the opposite of an inner join. Rather than just looking for those rows which have a key in common, it will join every row, regardless of the keys, inserting blank values where there is no match. You can get a sense of this merge with the venn diagram below, where the green indicates good matches, while the orange indicates partial matches.
Outer Join - Venn Diagram
In the context of our data, an outer join may look something like the following, again using hometown as our key. The code for an outer join is similar to an inner join, with one addition. We will want to specify that all = TRUE, essentially saying we want everything included.
outcome = merge(x = left_merge_df, y = right_merge_df, by = "hometown", all = TRUE)
Outer Join - Example Data
In this example, we see that an outer join will keep all available cases, but we introduce NA values into the resulting dataframe where no match could be made. This may or may not be an issue depending on what your next steps are.
A left join allows you to pick one of the two dataframes you are joining and prioritize it. It essentially takes all of the cases in the dataframe on the ‘left’ side, and searches in ‘right’ dataframe to join what it can. Keeping with the venn diagram representations, it would look like the following:
Left Join - Venn Diagram
Using our example data, a left join would look like this. The R code is similar, to the previous examples, but we specify that instead of all = TRUE, we just want all.x = TRUE. Recall that in the function call, x is the first dataframe, or the “left” one.
outcome = merge(x = left_merge_df, y = right_merge_df, by = "hometown", all.x = TRUE)
Left Join - Example Data
We can see that while all of the data from our ‘left’ side is preserved, unmatched data on the ‘right’ is discarded. You can also technically do a right join, which will do the same thing with sides reversed, but moving your prioritized data set to the left is more common.
If you were coding along, you can clean up our environment using the following code:
rm(outcome, upper_rbind_df, lower_rbind_df, left_cbind_df, right_cbind_df, left_merge_df, right_merge_df)
Checkpoint
Please use the Zoom reacts to respond. Green check is TRUE, red X is FALSE.
cbind checks rows for matches before combining dataframes.To practice with some of the tools we learned above, we will work with some example data together. The fun part is, the data will be generated by you! Please use the following link and fill out the Google form.
Class Data Generation
Workspace Preparation
Please run the following code to download our class data, and prepare it for the lesson.
# Load in student survey csv
raw_survey = read.csv("https://docs.google.com/spreadsheets/d/1E8ME8ldv8Pv4-saL-ZYHJckmVPNgcNd53LasYiW0jwU/export?format=csv&usp=sharing", header = TRUE, stringsAsFactors = FALSE)
# Let loose the data gremlins
# Don't look at this file until after lecture! Here there be spoilers.
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/1_split_duos.R")
After sending out invitations via our Google form, all of our favorite duos from history and fiction have agreed to come to lecture. On their way, the group was beset by data gremlins, resulting in all of our duos being separated! Not only that, but the gremlins have introduced noise and mess into our data.
The individuals from our duos have regathered themselves in two locations (dataframes), california and massachusetts. Every duo wants to find their way back home, and to reunite with their opposite.
Let’s start by looking at the dataframes where the individuals from our duos ended up. Each dataframe has 4 columns: the name of the individual, a color, the name of their alter (pairing), and their hometown.
Starting with the california:
head(california, 10)
## name color alter hometown
## 1 Rory Gilmore Yellow Lorelai Gilmore STARS HOLLOW
## 2 Harry Potter Red <NA> HOGWARTS
## 3 Aang Orange Katara REPUBLIC CITY
## 4 Han Solo Blue <NA> CORELLIA
## 5 Mandy Red Billy EVILLE
## 6 Vash the Stampede Red <NA> A DESERT PLANET
## 7 Ash Ketchum Red Pikachu PALLET TOWN
## 8 Midge Violet <NA> NEW YORK
## 9 Link Green Zelda CASTLE TOWN
## 10 Batman Red <NA> GOTHAM
str(california)
## 'data.frame': 13 obs. of 4 variables:
## $ name : chr "Rory Gilmore" "Harry Potter" "Aang" "Han Solo" ...
## $ color : chr "Yellow" "Red" "Orange" "Blue" ...
## $ alter : chr "Lorelai Gilmore" NA "Katara" NA ...
## $ hometown: chr "STARS HOLLOW" "HOGWARTS" "REPUBLIC CITY" "CORELLIA" ...
And then the massachusetts:
head(massachusetts, 10)
## name color alter hometown
## 1 Lorelai Gilmore Orange Rory Gilmore Stars_Hollow
## 2 Hermione Granger Yellow <NA> Hogwarts
## 3 Katara Blue Aang Republic_City
## 4 Chewbacca Orange <NA> Corellia
## 5 Billy Blue Mandy Eville
## 6 Nicholas D. Wolfwood Blue <NA> A_Desert_Planet
## 7 Pikachu Yellow Ash Ketchum Pallet_Town
## 8 Susie Blue <NA> New_York
## 9 Zelda Violet Link Castle_Town
## 10 Robin Yellow <NA> Gotham
str(massachusetts)
## 'data.frame': 13 obs. of 4 variables:
## $ name : chr "Lorelai Gilmore" "Hermione Granger" "Katara" "Chewbacca" ...
## $ color : chr "Orange" "Yellow" "Blue" "Orange" ...
## $ alter : chr "Rory Gilmore" NA "Aang" NA ...
## $ hometown: chr "Stars_Hollow" "Hogwarts" "Republic_City" "Corellia" ...
We want, by the end of this lesson, to reunite our duos. To do so, we will want to create a new dataframe, where each row has a single duo pair: one column for each member, their associated color, and hometown. And example of our desired output can be seen here.
Desired Output Example
Checkpoint
In order to join these dataframes and reunite our duos, we need to choose some method to match on. For this activity, we will be using one of the key based methods we covered above. However, data is rarely in a clean and usable state from the start. Often the bulk of your time will be spent preparing the data to do the actual task you want to accomplish.
For our example, we will be using the “hometown” as our matching key; we ultimately want all of our duos to get home! However, the data gremlins have made this task more difficult, as the hometowns have various noise in them that prevents us from using them as a key. We’re going to have to repair the damage before we can reunite our duos.
I’ll walk you through one method below, but stress that this is not an “optimal” answer, it is just one of many.
Checkpoint
Spoiler Space
Don’t scroll past here until you’ve answered the above questions!
To use “hometown” as a key, we will need to standardize the capitalization and the space deliminators. Let’s take care of the capitalization first.
While we could spend time making sure all of our hometowns are properly capitalized, it isn’t really necessary in this application; we just need them to match. An easy way to do that is to set everything to upper or lower case. Let’s use lower case here, making use of the tolower function in R. It will convert any character vector into all lower case.
california$hometown = tolower(california$hometown)
massachusetts$hometown = tolower(massachusetts$hometown)
Let’s pause to look at our work …
head(california, 10)
## name color alter hometown
## 1 Rory Gilmore Yellow Lorelai Gilmore stars hollow
## 2 Harry Potter Red <NA> hogwarts
## 3 Aang Orange Katara republic city
## 4 Han Solo Blue <NA> corellia
## 5 Mandy Red Billy eville
## 6 Vash the Stampede Red <NA> a desert planet
## 7 Ash Ketchum Red Pikachu pallet town
## 8 Midge Violet <NA> new york
## 9 Link Green Zelda castle town
## 10 Batman Red <NA> gotham
Looks good. Now let’s deal with the spacing. If you look at our dataframes and compare across them, you’ll notice some hometowns have underscores in them. This seems to only happen when there would be a space. We can use that rule to replace all the underscores with a space instead. To do this, we’ll use a gsub function, or “global substitution.” It will look over all of the strings, and globally replace a pattern we specify with a substitution. In this case, we want to replace all "_" with " ".
# gsub(Pattern, replacement, X)
# pattern is what we want to replace
# replacement is what we want to replace the pattern
# X is the vector of character strings we want to run this replacement on
california$hometown = gsub(pattern = "_", replacement = " ", x = california$hometown)
massachusetts$hometown = gsub(pattern = "_", replacement = " ", x = massachusetts$hometown)
Let’s check our work …
head(california, 10)
## name color alter hometown
## 1 Rory Gilmore Yellow Lorelai Gilmore stars hollow
## 2 Harry Potter Red <NA> hogwarts
## 3 Aang Orange Katara republic city
## 4 Han Solo Blue <NA> corellia
## 5 Mandy Red Billy eville
## 6 Vash the Stampede Red <NA> a desert planet
## 7 Ash Ketchum Red Pikachu pallet town
## 8 Midge Violet <NA> new york
## 9 Link Green Zelda castle town
## 10 Batman Red <NA> gotham
Looking good! It seems we have a consistent key. Now for the ultimate task of reuniting our duos.
Unfortunately, our duos were not split evenly. Some duos both got sent to the same place, while others were sent to different places. As with many data science problems, there are a number of ways to overcome this, and no “best” answer.
Checkpoint
Spoiler Space
Don’t scroll past here until you’ve answered the above questions!
Alright, now we have two dataframes, each containing one member of a duo. Time to finally merge them back together! We can do this using the merge function we learned about previously. Let’s try an inner merge.
# assign to a new reunion dataframe
# the result of a inner merge between dataframes `duo1` and `duo2`
# using column "hometown" as the key
reunion = merge(california, massachusetts, by = "hometown")
# show results
reunion
## hometown name.x color.x alter.x
## 1 a desert planet Vash the Stampede Red <NA>
## 2 castle town Link Green Zelda
## 3 coolsville Scooby Doo Orange <NA>
## 4 corellia Han Solo Blue <NA>
## 5 eville Mandy Red Billy
## 6 gotham Batman Red <NA>
## 7 hogwarts Harry Potter Red <NA>
## 8 koriko Kiki Red JiJi
## 9 london Sherlock Blue Watson
## 10 new york Midge Violet <NA>
## 11 pallet town Ash Ketchum Red Pikachu
## 12 republic city Aang Orange Katara
## 13 stars hollow Rory Gilmore Yellow Lorelai Gilmore
## name.y color.y alter.y
## 1 Nicholas D. Wolfwood Blue <NA>
## 2 Zelda Violet Link
## 3 Shaggy Green <NA>
## 4 Chewbacca Orange <NA>
## 5 Billy Blue Mandy
## 6 Robin Yellow <NA>
## 7 Hermione Granger Yellow <NA>
## 8 JiJi Indigo Kiki
## 9 Watson Indigo Sherlock
## 10 Susie Blue <NA>
## 11 Pikachu Yellow Ash Ketchum
## 12 Katara Blue Aang
## 13 Lorelai Gilmore Orange Rory Gilmore
A bit messy still, but we did it! Each duo has been reunited in one row. Now we can clean up the results a bit, and arrive at our desired outcome. You’ll notice some column names now have a “.x” or “.y” at the end. This is because in the original duo dataframes, the columns had the same name. Since two columns can’t have the same name once we join them in a single dataframe, the merge function add these suffixes to denote which column came from which dataframe. The columns from the first dataframe in the merge function always gets the “.x” ending, while the second gets the “.y” ending.
# first let's keep only the columns we want
# assign to new dataframe `outcome` the contents of `reunion`
# SUCH THAT
# all rows
# and the columns are named
outcome = reunion[ , c("name.x", "color.x", "name.y", "color.y", "hometown")]
Lastly, we can rename our columns.
# rename columns of dataframe `outcome`
colnames(outcome) = c("mem_1", "mem_1_color", "mem_2", "mem_2_color", "hometown")
# show results
head(outcome, 10)
## mem_1 mem_1_color mem_2 mem_2_color
## 1 Vash the Stampede Red Nicholas D. Wolfwood Blue
## 2 Link Green Zelda Violet
## 3 Scooby Doo Orange Shaggy Green
## 4 Han Solo Blue Chewbacca Orange
## 5 Mandy Red Billy Blue
## 6 Batman Red Robin Yellow
## 7 Harry Potter Red Hermione Granger Yellow
## 8 Kiki Red JiJi Indigo
## 9 Sherlock Blue Watson Indigo
## 10 Midge Violet Susie Blue
## hometown
## 1 a desert planet
## 2 castle town
## 3 coolsville
## 4 corellia
## 5 eville
## 6 gotham
## 7 hogwarts
## 8 koriko
## 9 london
## 10 new york
Checkpoint
That was hopefully a fun spreadsheet-based adventure, and I’m glad it had a happy ending. But how can we use what we’ve learned in the real world? The inspiration for this lesson came from a task I needed to accomplish for a student organization I helped run, called Hack 4 California. The lab would meet weekly to work on civic data projects for our community, from small projects like helping the campus food pantry with usage statistics, and to full-fledged examinations of government data regarding exposure to hazardous materials in prisons with collaborators at UCLA.
To continue learning from the experience, my co-lead and I wanted to calculate attendance, and reach out to everyone who had participated to ask them what they thought! However, we had an issue. Because of the pandemic, half of our attendance data was in PDFs of excel sheets, and the other half was in zoom attendance logs. In total we had two PDFs, and 28 Zoom logs. You can see versions of these files with fake names below.
PDF Attendance Data
Zoom Log Attendance Data
The data formats are not ideal (Why PDFs of an excel sheet?), but such is the case with most projects. You can import versions of these dataframes to work with using the following code:
source("https://raw.githubusercontent.com/Epsian/table_join_lecture/main/src/3_real_examples.R")
We wanted to create a single dataframe that contained complete attendance data, so that we could see how many times each person attended. Using the skills you’ve learned in this lecture and the course so far, create a single dataframe which contains a row for each person, and columns containing their name, a count of how many times they attended, and their email if available.
Tips